Understanding HyperLogLog

HyperLogLog is an algorithm for estimating cardinality of multisets. Notably, the algorithm only uses \(O(1)\) space. Join me on a journey to understand it by implementing it!


Estimating the number of unique elements in a set/stream has many applications, especially in the field of databases. For example, if getting such estimates is faster or cheaper than computing the true value, it can be used as part of join ordering where the cost of tables is derived from the estimated row count. This paper describes the HyperLogLog algorithm - an algorithm that only uses \(O(1)\) space, and is only \(O(n)\) in runtime. It’s a very cool algorithm, and I recommend reading the original paper, though there were definitely aspects of it that I struggled to get an intuition for, so I decided to implement it myself as a learning exercise.

The HyperLogLog Algorithm

I highly recommend reading this write-up by Meta about how HyperLogLog works. It helped me get a lot of intuition for the process, but here’s my short summary:

HyperLogLog works by mapping elements into a roughly uniformly distributed space by using a hash function. If this space was between 0 and 1, we could use the inverse of the lowest element as an approximation of the number of elements (if you have \(n\) elements evenly distributed in \((0, 1)\), then the lowest element will be \(\frac{1}{n}\)). We can get a value in \((0, 1)\) by counting the number of leading 0’s (\(k\)) in every hashed value and then using \(\frac{1}{2^{k}}\) as our value - we only care about the lowest value, so we look for the element that maximizes \(k\). However, this would only give us coarse, power-of-two sized estimates, so we could do better by using many different hash functions and averaging the values estimated by each function. This is computationally expensive, so instead, we take some number of bit \(b\) from the prefix of the hashed value, and use that as an identifier to split the input into \(m = 2^{b}\) streams. The remaining bits are used as before to estimate the cardinality.

One interesting property to note is that HyperLogLog “remembers” seen elements by virtue of it’s usage of max. The key idea is that max is commutative, and \(max(x, x) = x\), which means that adding the same element twice is the same as adding it once. Since the commutativity means that order of operations doesn’t matter, the order in which repeated elements are present doesn’t matter either. This is mainly relevant to this discussion in terms of testing, since it simplifies the kind of test data I’ll need to produce.

Writing my implementation

Once I’d read the initial paper, I set about trying to implement it myself. The algorithm itself is very straightforward, but was even cooler was that Github’s Copilot seemed to understand what I was doing and suggested almost exactly what I wanted! However, I was a little weary of relying on the auto-complete too heavily since I wanted the experience of writing things out myself to learn better. For my choice of hash function, I went with python’s built in hash. Since most of the paper relied on manipulating bits, the actual function I used was: bin(abs(hash(x)))[2:] which takes the absolute value of the hash, converts it to a binary string, and removes the 0b prefix. However, running my initial code produced wildly inaccurate results. Commence debugging!

Debugging my implementation

My first step in the debugging process was to find a reference implementation to compare against. I quickly found this repo and began comparing. This repo seems to include a lot of optimizations that the original paper doesn’t mention, so I tried ignoring most of those, since this exercise is mostly about getting the core ideas down. Scrolling through the code seemed to indicate that I was generally on the right track, which was encouraging.

The most obvious difference to me was that the reference implementation tried to ignore registers that didn’t have any elements (empty streams ignored), however, that didn’t affect this implementation since every register was being filled. I added a debug assert to ensure that I wasn’t hitting that cast to be safe.

My first suspicion was the choice of hash function. At work we’d seen strange behavior with the distribution of python’s hash function, albeit only on older python versions. Looking at the reference implementation, they used sha1 instead. Could that really make a difference in the computation? To verify this, I injected code into the reference and my own implementation to produce a histogram of results from calls to the rho function, which was responsible for computing length of the 0 prefix of hashed data.

Here’s what my rho function looked like with this additional collection:

class HyperLogLog:
    ...

    observed_ps = {}
    def rho(self, s: str) -> int:
        v = s.index('1') + 1
        self.observed_ps[v] = self.observed_ps.get(v, 0) + 1
        return v
...

h = HyperLogLog()
# Simulate 100_000 unique values
for x in range(100000):
    h.hyperloglog(f"foobar{x}")

print({k: h.observed_ps[k] for k in sorted(h.observed_ps.keys())})

And here’s some sample values produced:

{
    1: 50102,
    2: 25102,
    3: 12383,
    4: 6265,
    5: 3093,
    6: 1465,
    7: 779,
    8: 387,
    9: 210,
    10: 105,
    11: 53,
    12: 37,
    13: 9,
    14: 3,
    15: 4,
    16: 2,
    18: 1
}

Similar modifications were done to the reference implementation. This turned out to be a great idea, and confirmed that my hash function was producing values that closely matched the distribution in the reference implementation. I was also concerned that the bit width of values I was working with not being large enough, but the histogram seemed to show that having at least 20 bits was more than sufficient for the error rate desired. This makes sense since the probability of encountering values with a specific 0-prefix length is exponentially inversely related to the length.

After some more poking around and comparing intermediate values from my solution vs the reference, I stumbled upon the bug. For some reason, copilot had implemented the sum term in the harmonic mean as \(2^{\sum{2^{-M_j}}}\) instead of \(\sum{2^{-M_j}}\). It goes to show that trusting AIs in domains where I’m not already well-informed is a bad idea. Lessons learned.

While this brought the estimates to a more reasonable value, it still seemed to be underestimating by a factor of 2. This seemed suspiciously consistent and the algorithm already includes a constant factor adjustment, so I concluded that the general steps of the algorithm must be mostly correct. After reviewing the reference solution again, I realized that my implementation of rho was off by 1, which made sense, and it was easy to prove by looking at the original equation that this would cause the estimate to be divided by 2 overall.

The working code is presented below:

My code

Here’s the final code I ended up with, including some comments.

import math
from typing import List

class HyperLogLog:
    error_rate = 0.1
    # Determine the number of "buckets" based on the error_rate
    b = int(math.ceil(math.log((1.04 / error_rate) ** 2, 2)))
    m = 2**b
    # Registers storing Max-values for each "bucket"
    M: List[float]

    def __init__(self) -> None:
        # Initialize registers to 0 (-inf in the paper)
        self.M = [0] * self.m

    def rho(self, s: str) -> int:
        """Find the first 1 bit"""
        v = s.index('1') + 1
        return v

    def h(self, s: str) -> str:
        """Compute the hash of s and return it as a binary string"""
        x = abs(hash(s))
        s = bin(x)[2:]
        # pad to 32 bits
        while len(s) < 32:
            s = '0' + s
        # truncate to 32 bits
        s = s[-32:]
        return s

    def add(self, s: str):
        """Add an element to the estimator"""
        x = self.h(s)
        j = int(x[:self.b], 2)
        w = x[self.b:]
        self.M[j] = max(self.M[j], self.rho(w))

    def z(self) -> float:
        """The sum portion (denominator) of the harmonic mean"""
        return sum(2**-x for x in self.M)

    def get_alpha(self):
        """Constant factor to weight results"""
        # This is copied from https://github.com/svpcom/hyperloglog
        if self.b == 4:
            return 0.673

        if self.b == 5:
            return 0.697

        if self.b == 6:
            return 0.709

        return 0.7213 / (1.0 + 1.079 / self.m)


    def estimate(self):
        """Estimate the cardinality of all elements seen so far"""
        # constant * m * harmonic mean of 2**-M
        return self.get_alpha() * (self.m**2) / self.z()

h = HyperLogLog()
for x in range(100000):
    h.add(f"foobar{x}")

# The implementation above doesn't handle unvisited registers well
# assert h.M.count(0) == 0, f"num registers/0 regs = {len(h.M)}/{h.M.count(0)}"
print("Estimate:", h.estimate())

Playing with HyperLogLog

Once I had a working implementation, I was able to experiment with a few things. I tried using k hash functions instead of partition the input into k streams. This was done by using the following logic:

for i in range(len(M)):
    M[i] = max(M[i], rho(hash(f'hash_{i}: {s}')))

This performed worse (and was insanely slow), tending to overestime far more than the original implementation. I want to continue exploring this, and see if maybe it’s just a matter of adjusting the constants involved.

I also tried taking inspiration from older algorithms like LogLog and I wanted to see what would happen if I only took the 70% lowest values instead of all values for the harmonic mean (bias away from outliers), this was done by changing the z function to:

def z(self) -> float:
    def getLower70Percent():
        return sorted(self.M)[:int(self.m * 0.7)]
    return sum(2**-x for x in getLower70Percent())

and multiplying by \(0.7\) in the final computation. This performed way worse than the original implementation, but if I removed the \(0.7\) multiplied at the end, it only slightly underestimated compared to the original.

Thoughts

This was a fun exercise, and implementing my own version of this algorithm greatly improved my understanding of the concepts involved. There’s definitely parts of the proof that invoke concepts from statistics that I’m not super familiar with anymore, but brushing up on some of those was also a good experience. I’m glad I got something working, and being able to tweak the algorithm and see how things change was also interesting. I would definitely recommend this process to anyone exploring a new area!

Written on March 19, 2024